NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Graph-constrained analysis for multivariate functional data

https://doi.org/10.1016/j.jmva.2025.105428

Dey, Debangan; Banerjee, Sudipto; Lindquist, Martin A; Datta, Abhirup (May 2025, Journal of Multivariate Analysis)

The manuscript considers multivariate functional data analysis with a known graphical model among the functional variables representing their conditional relationships (e.g., brain region-level fMRI data with a prespecified connectivity graph among brain regions). Functional Gaussian graphical models (GGM) used for analyzing multivariate functional data customarily estimate an unknown graphical model, and cannot preserve knowledge of a given graph. We propose a method for multivariate functional analysis that exactly conforms to a given inter-variable graph. We first show the equivalence between partially separable functional GGM and graphical Gaussian processes (GP), proposed recently for constructing optimal multivariate covariance functions that retain a given graphical model. The theoretical connection helps to design a new algorithm that leverages Dempster’s covariance selection for obtaining the maximum likelihood estimate of the covariance function for multivariate functional data under graphical constraints. We also show that the finite term truncation of functional GGM basis expansion used in practice is equivalent to a low-rank graphical GP, which is known to oversmooth marginal distributions. To remedy this, we extend our algorithm to better preserve marginal distributions while respecting the graph and retaining computational scalability. The benefits of the proposed algorithms are illustrated using empirical experiments and a neuroimaging application.
more » « less
Free, publicly-accessible full text available May 1, 2026
Modeling Multivariate Spatial Dependencies Using Graphical Models

https://doi.org/10.51387/23-NEJSDS47

Dey, Debangan; Datta, Abhirup; Banerjee, Sudipto (September 2023, The New England Journal of Statistics in Data Science)

Graphical models have witnessed significant growth and usage in spatial data science for modeling data referenced over a massive number of spatial-temporal coordinates. Much of this literature has focused on a single or relatively few spatially dependent outcomes. Recent attention has focused upon addressing modeling and inference for substantially large number of outcomes. While spatial factor models and multivariate basis expansions occupy a prominent place in this domain, this article elucidates a recent approach, graphical Gaussian Processes, that exploits the notion of conditional independence among a very large number of spatial processes to build scalable graphical models for fully model-based Bayesian analysis of multivariate spatial data.
more » « less
Full Text Available
nnSVG for the scalable identification of spatially variable genes using nearest-neighbor Gaussian processes

https://doi.org/10.1038/s41467-023-39748-z

Weber, Lukas M.; Saha, Arkajyoti; Datta, Abhirup; Hansen, Kasper D.; Hicks, Stephanie C. (December 2023, Nature Communications)

Abstract Feature selection to identify spatially variable genes or other biologically informative genes is a key step during analyses of spatially-resolved transcriptomics data. Here, we propose nnSVG, a scalable approach to identify spatially variable genes based on nearest-neighbor Gaussian processes. Our method (i) identifies genes that vary in expression continuously across the entire tissue or within a priori defined spatial domains, (ii) uses gene-specific estimates of length scale parameters within the Gaussian process models, and (iii) scales linearly with the number of spatial locations. We demonstrate the performance of our method using experimental data from several technological platforms and simulations. A software implementation is available at https://bioconductor.org/packages/nnSVG .
more » « less
Full Text Available
Scalable Predictions for Spatial Probit Linear Mixed Models Using Nearest Neighbor Gaussian Processes

https://doi.org/10.6339/22-JDS1073

Saha, Arkajyoti; Datta, Abhirup; Banerjee, Sudipto (November 2022, Journal of Data Science)

Spatial probit generalized linear mixed models (spGLMM) with a linear fixed effect and a spatial random effect, endowed with a Gaussian Process prior, are widely used for analysis of binary spatial data. However, the canonical Bayesian implementation of this hierarchical mixed model can involve protracted Markov Chain Monte Carlo sampling. Alternate approaches have been proposed that circumvent this by directly representing the marginal likelihood from spGLMM in terms of multivariate normal cummulative distribution functions (cdf). We present a direct and fast rendition of this latter approach for predictions from a spatial probit linear mixed model. We show that the covariance matrix of the cdf characterizing the marginal cdf of binary spatial data from spGLMM is amenable to approximation using Nearest Neighbor Gaussian Processes (NNGP). This facilitates a scalable prediction algorithm for spGLMM using NNGP that only involves sparse or small matrix computations and can be deployed in an embarrassingly parallel manner. We demonstrate the accuracy and scalability of the algorithm via numerous simulation experiments and an analysis of species presence-absence data.
more » « less
Full Text Available
Laboratory and field evaluation of a low-cost methane sensor and key environmental factors for sensor calibration

https://doi.org/10.1039/d2ea00100d

Lin, Joyce J.; Buehler, Colby; Datta, Abhirup; Gentner, Drew R.; Koehler, Kirsten; Zamora, Misti Levy (April 2023, Environmental Science: Atmospheres)

Low-cost sensors enable finer-scale spatiotemporal measurements within the existing methane (CH 4 ) monitoring infrastructure and could help cities mitigate CH 4 emissions to meet their climate goals. While initial studies of low-cost CH 4 sensors have shown potential for effective CH 4 measurement at ambient concentrations, sensor deployment remains limited due to questions about interferences and calibration across environments and seasons. This study evaluates sensor performance across seasons with specific attention paid to the sensor's understudied carbon monoxide (CO) interferences and environmental dependencies through long-term ambient co-location in an urban environment. The sensor was first evaluated in a laboratory using chamber calibration and co-location experiments, and then in the field through two 8 week co-locations with a reference CH 4 instrument. In the laboratory, the sensor was sensitive to CH 4 concentrations below ambient background concentrations. Different sensor units responded similarly to changing CH 4 , CO, temperature, and humidity conditions but required individual calibrations to account for differences in sensor response factors. When deployed in-field, co-located with a reference instrument near Baltimore, MD, the sensor captured diurnal trends in hourly CH 4 concentration after corrections for temperature, absolute humidity, CO concentration, and hour of day. Variable performance was observed across seasons with the sensor performing well ( R 2 = 0.65; percent bias 3.12%; RMSE 0.10 ppm) in the winter validation period and less accurately ( R 2 = 0.12; percent bias 3.01%; RMSE 0.08 ppm) in the summer validation period where there was less dynamic range in CH 4 concentrations. The results highlight the utility of sensor deployment in more variable ambient CH 4 conditions and demonstrate the importance of accounting for temperature and humidity dependencies as well as co-located CO concentrations with low-cost CH 4 measurements. We show this can be addressed via Multiple Linear Regression (MLR) models accounting for key covariates to enable urban measurements in areas with CH 4 enhancement. Together with individualized calibration prior to deployment, the sensor shows promise for use in low-cost sensor networks and represents a valuable supplement to existing monitoring strategies to identify CH 4 hotspots.
more » « less
Full Text Available
Identifying optimal co-location calibration periods for low-cost sensors

https://doi.org/10.5194/amt-16-169-2023

Levy Zamora, Misti; Buehler, Colby; Datta, Abhirup; Gentner, Drew R.; Koehler, Kirsten (January 2023, Atmospheric Measurement Techniques)

Abstract. Low-cost sensors are often co-located with reference instruments to assess their performance and establish calibration equations, but limiteddiscussion has focused on whether the duration of this calibration period can be optimized. We placed a multipollutant monitor that containedsensors that measured particulate matter smaller than 2.5 µm (PM2.5), carbon monoxide (CO), nitrogendioxide (NO2), ozone (O3), and nitric oxide (NO) at a reference field site for 1 year. We developed calibration equationsusing randomly selected co-location subsets spanning 1 to 180 consecutive days out of the 1-year period and compared the potential root-mean-square error (RMSE) and Pearson correlation coefficient (r) values. The co-located calibration period required to obtain consistent results varied bysensor type, and several factors increased the co-location duration required for accurate calibration, including the response of a sensor toenvironmental factors, such as temperature or relative humidity (RH), or cross-sensitivities to other pollutants. Using measurements fromBaltimore, MD, where a broad range of environmental conditions may be observed over a given year, we found diminishing improvements in the medianRMSE for calibration periods longer than about 6 weeks for all the sensors. The best performing calibration periods were the ones that contained arange of environmental conditions similar to those encountered during the evaluation period (i.e., all other days of the year not used in thecalibration). With optimal, varying conditions it was possible to obtain an accurate calibration in as little as 1 week for all sensors, suggestingthat co-location can be minimized if the period is strategically selected and monitored so that the calibration period is representative of thedesired measurement setting.
more » « less
Full Text Available
Non-linear probabilistic calibration of low-cost environmental air pollution sensor networks for neighborhood level spatiotemporal exposure assessment

https://doi.org/10.1038/s41370-022-00493-y

Patton, Andrew; Datta, Abhirup; Zamora, Misti Levy; Buehler, Colby; Xiong, Fulizi; Gentner, Drew R.; Koehler, Kirsten (November 2022, Journal of Exposure Science & Environmental Epidemiology)

Full Text Available
RandomForestsGLS: An R package for Random Forests fordependent data

https://doi.org/10.21105/joss.03780

Saha, Arkajyoti; Basu, Sumanta; Datta, Abhirup (February 2022, Journal of Open Source Software)

Full Text Available
An illustration of model agnostic explainability methods applied to environmental data

https://doi.org/10.1002/env.2772

Wikle, Christopher K.; Datta, Abhirup; Hari, Bhava Vyasa; Boone, Edward L.; Sahoo, Indranil; Kavila, Indulekha; Castruccio, Stefano; Simmons, Susan J.; Burr, Wesley S.; Chang, Won (February 2023, Environmetrics)

Historically, two primary criticisms statisticians have of machine learning and deep neural models is their lack of uncertainty quantification and the inability to do inference (i.e., to explain what inputs are important). Explainable AI has developed in the last few years as a sub‐discipline of computer science and machine learning to mitigate these concerns (as well as concerns of fairness and transparency in deep modeling). In this article, our focus is on explaining which inputs are important in models for predicting environmental data. In particular, we focus on three general methods for explainability that are model agnostic and thus applicable across a breadth of models without internal explainability: “feature shuffling”, “interpretable local surrogates”, and “occlusion analysis”. We describe particular implementations of each of these and illustrate their use with a variety of models, all applied to the problem of long‐lead forecasting monthly soil moisture in the North American corn belt given sea surface temperature anomalies in the Pacific Ocean.
more » « less
Full Text Available
spNNGP R Package for Nearest Neighbor Gaussian Process Models

https://doi.org/10.18637/jss.v103.i05

Finley, Andrew O.; Datta, Abhirup; Banerjee, Sudipto (January 2022, Journal of Statistical Software)

Full Text Available

« Prev Next »

Search for: All records